Fast Failure Recovery in Distributed Graph Processing Systems
نویسندگان
چکیده
Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpointbased recovery by up to 30x on a cluster of 40 compute nodes.
منابع مشابه
Lightweight Fault Tolerance in Large-Scale Distributed Graph Processing
The success of Google’s Pregel framework in distributed graph processing has inspired a surging interest in developing Pregel-like platforms featuring a user-friendly “think like a vertex” programming model. Existing Pregel-like systems support a fault tolerance mechanism called checkpointing, which periodically saves computation states as checkpoints to HDFS, so that when a failure happens, co...
متن کاملAsynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage
Large-scale interactive applications and online graph analytic processing require very fast data access to many small data objects. DXRAM addresses these challenges by keeping all data always in memory of potentially many nodes aggregated in a data center. Data loss in case of node failures is prevented by an asynchronous logging on flash disks. In this paper we present the architecture of a no...
متن کاملA new Shuffled Genetic-based Task Scheduling Algorithm in Heterogeneous Distributed Systems
Distributed systems such as Grid- and Cloud Computing provision web services to their users in all of the world. One of the most important concerns which service providers encounter is to handle total cost of ownership (TCO). The large part of TCO is related to power consumption due to inefficient resource management. Task scheduling module as a key component can has drastic impact on both user...
متن کاملA Protocol for Consistent Checkpointing Recovery for Time-Critical Distributed Database Systems
This paper presents a checkpointing scheme which effectively copes with media failures for a distributed database system (DDBS), which employs the timestamp ordering scheme for concurrency control. In our scheme, normal transactions are executed during the checkpointing process without any interruption. The state of the database taken as a checkpoint by all sites in the system is consistent, so...
متن کاملImplementation and Performance of Transparent Rollback-recovery in Manetho
We describe the implementation and performance of rollback-recovery in Manetho. During failure-free operation, Manetho maintains an antecedence graph which records the \happened before" relation between certain events in the distributed computation. The antecedence graph is used in combination with checkpointing and volatile sender-based message logging to simultaneously achieve low failure-fre...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 8 شماره
صفحات -
تاریخ انتشار 2014